Lexicon Development and POS Tagging Using a Tagged Bengali News Corpus

نویسندگان

  • Asif Ekbal
  • Sivaji Bandyopadhyay
چکیده

Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing(NLP) application areas. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. This corpus is then used for lexicon development and POS tagging. Tagged Bengali News Corpus Development Newspaper is a huge source of readily available documents. A tagged corpus has been developed from the web archive of a very well known and widely read Bengali News Paper. The development of the tagged Bengali news corpus includes language resource acquisition using a web crawler, language resource creation which includes HTML file cleaning and code conversion, as well as language resource annotation that involves defining a tag set and subsequent tagging of the news corpus. Code conversion is necessary to convert the dynamic fonts used in the newspaper into the standard Indian Standard Code for Information Interchange (ISCII) form, which can be processed for various text processing tasks. At present, the corpus contains 34 million wordforms and it is available in both ISCII and UTF-8 formats. A news corpus, whether in Bengali or in any other language has different parts like title, date, reporter, location, body etc. To identify these parts in a news corpus, the following tagset has been defined: header (Header of the news document), title (Headline of the news document), t1 (1st headline of the title), t2 (2nd headline of the title), date (Date of the news document), bd (Bengali date), day (Day), ed (English date), reporter (Reporter-name), agency (Agency providing news), location (the news location), body (Body of the news document), p (Paragraph), table (information in tabular form), tc (Table Column), and tr (Table row). Lexicon Development from the Corpus The tagged Bengali news corpus has been used to develop a Bengali lexicon that is a list of Bengali root words derived Copyright c © 2007, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. from the corpus along with its basic part of speech information. An unsupervised learning method has been used for the lexicon development. No extensive knowledge about the language is required except the knowledge of the different inflections that can appear with the different words in Bengali. In Bengali, there are five different parts of speech namely: noun, pronoun, verb, adjective and indeclinable (postpositions, conjunctions, and interjections). Noun, verb and adjective are the open class of part of speech for Bengali. Initially, all the words (inflected and uninflected) are extracted from the tagged corpus and added to the database. A list of inflections that may appear with the noun words is kept and at present the list has 27 entries. In Bengali, the verbs can be organized into 20 different groups according to their spelling patterns and the different inflections that can be attached to them. Original word-form of a verb word often changes when any suffix is attached to the verb. At present, there are 214 different entries in the verb inflection list. Noun and verb words are tagged by looking at their inflections. Some inflections may be common to both nouns and verbs. In these cases, more than one root words will be generated for a wordform. The POS ambiguity is resolved by checking the number of occurrences of these possible root words along with the POS tags as derived from the other wordforms. Pronoun and indeclinable are basically closed class of part of speech in Bengali and these are added to the lexicon manually. It has been observed that adjectives in Bengali generally occur in four different forms based on the suffixes attached. The first type of adjectives can form comparative and superlative degree by attaching the suffixes -tara and -tamo to the adjective word. These adjective stems are stored in the lexicon with adjective POS. The second set of suffixes (e.g., -gato, -karo etc.) identifies the POS of the wordform as adjective if only there is a noun entry of the desuffixed word in the lexicon. The third group of suffixes (e.g., -janok, -sulav etc.) identifies the POS of the wordform as adjective and the desuffixed word is included in the lexicon with noun POS. The last set of suffixes identifies the POS of the wordform as adjective. Hidden Markov Model Based POS Tagging A POS tagger based on the modified Hidden Markov Model (HMM) has been developed using a portion of the tagged Bengali news corpus. We have used a tagset having 27 dif-

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web-Based Bengali News Corpus for Lexicon Development and POS Tagging

Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing (NLP) applications. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. We have used a Bengali news corpus, developed from the web archive of a widely read Bengali newspaper. The ...

متن کامل

Maximum Entropy Based Bengali Part of Speech Tagging

Part of Speech (POS) tagging can be described as a task of doing automatic annotation of syntactic categories for each word in a text document. This paper presents a POS tagger for Bengali using the statistical Maximum Entropy (ME) model. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various POS cl...

متن کامل

Weakly Supervised Part-of-Speech Tagging for Morphologically-Rich, Resource-Scarce Languages

This paper examines unsupervised approaches to part-of-speech (POS) tagging for morphologically-rich, resource-scarce languages, with an emphasis on Goldwater and Griffiths’s (2007) fully-Bayesian approach originally developed for English POS tagging. We argue that existing unsupervised POS taggers unrealistically assume as input a perfect POS lexicon, and consequently, we propose a weakly supe...

متن کامل

Voted Approach for Part of Speech Tagging in Bengali

Part of Speech (POS) tagging is the task of labeling each word in a sentence with its appropriate syntactic category called part of speech. POS tagging is a very important preprocessing task for language processing activities. In this paper, we report about our work on POS tagging for Bengali by combining different POS tagging systems using three weighted voting techniques. The individual POS t...

متن کامل

A Hybrid Model for Part-of-Speech Tagging and its Application to Bengali

— This paper describes our work on Bengali Part of Speech (POS) tagging using a corpus-based approach. There are several approaches for part of speech tagging. This paper deals with a model that uses a combination of supervised and unsupervised learning using a Hidden Markov Model (HMM). We make use of small tagged corpus and a large untagged corpus. We also make use of Morphological Analyzer. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007